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Abstract 

We present and study two communication-efficient algorithms for distributed statistical op- 
I timization on large-scale data sets. The first algorithm is a standard averaging method that 

distributes the N data samples evenly to m machines, performs separate minimization on each 
subset, and then averages the estimates. We provide a sharp analysis of this average mixture 
^ 1 ■ algorithm, showing that under a reasonable set of conditions, the combined parameter achieves 

mean-squared error that decays as 0{N~^ + {N/m)~^). Whenever m < ^/N, this guarantee 
matches the best possible rate achievable by a centralized algorithm having access to all N 
samples. The second algorithm is a novel method, based on an appropriate form of bootstrap 
subsampling. Requiring only a single round of communication, it has mean-squared error that 
^ " decays as 0{N^^ + {N/m)~^), and so is more robust to the amount of parallelization. We 

also provide experimental evaluation of our methods, investigating their performance both on 
simulated data and on a large-scale regression problem from the internet search domain. In 

■ particular, we show that our methods can be used to efficiently solve an advertisement pre- 
\ diction problem from the Chinese SoSo Search Engine, which involves logistic regression with 

iV « 2.4 X 10* samples and d k. 740,000 covariates. 

^ ■ 1 Introduction 
(N 

Many procedures for statistical estimation are based on a form of (regularized) empirical risk min- 
imization, meaning that a parameter of interest is estimated by minimizing an objective function 
\r2 \ defined by the average of a loss function over the data. Given the current explosion in the size 

■ and amount of data available in in statistical studies, a central challenge is to design efficient algo- 



rithms for solving large-scale problem instances. In a centralized setting, there are many procedures 
for solving empirical risk minimization problems, among them standard convex programming ap- 



proaches [3] as well as stochastic approximation and optimization algorithms j27l. I 111. l2Cll|. When 
the size of the dataset becomes extremely large, however, it may be infeasible to store all of the 
data on a single computer, or at least to keep the data in memory. Accordingly, the focus of this 
paper is the study of some distributed and communication-efhcient procedures for empirical risk 
minimization. 
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Recent years have witnessed a flurry of research on distributed approaches to solving very large- 
scale statistical optimization problems. Although we cannot survey the literature adequately — 
the list of papers 0, Si, [13, 0, S, Q, 0, B and references therein contain a sample of relevant 
work — we touch on a few important themes here. It can be difficult within a purely optimization- 
theoretic setting to show explicit benefits arising from distributed computation. In statistical 
settings, however, distributed computation can lead to gains in statistical efficiency, as shown by a 
number of authors [l|,0, 26, ^. Within the family of distributed algorithms, there can be significant 
differences in communication complexity: different computers must be synchronized, and when 
the dimensionality of the data is high, communication can be prohibitively expensive. It is thus 
interesting to study distributed estimation algorithms that require fairly limited synchronization 
and communication, while still enjoying the greater statistical accuracy that is usually associated 
with a larger dataset. 

With this context, perhaps the simplest algorithm for distributed statistical estimation is what 
we term the average mixture (Avgm) algorithm. It is an appealingly simple method: given m 
different machines and a dataset of size A^, assign to each machine a (distinct) dataset of size 
n = N/m, have each machine i compute the empirical minimizer 0i on its fraction of the data, and 
then average all the parameter estimates 6i across the machines. This approach been studied for 
some classification and estimation problems by Mann et al. [3] and McDonald et al. as well 
as for certain stochastic approximation methods by Zinkevich et al. [sO]. Given an empirical risk 
minimization algorithm that works on one machine, the procedure is straightforward to implement 
and is extremely communication efficient, requiring only a single round of communication. It is 
also relatively robust to possible failures in a subset of machines and/or differences in speeds, since 
there is no repeated synchronization. To the best of our knowledge, however, no work has shown 
rigorously that the Avgm procedure has greater statistical efficiency than the naive approach of 
using n = N/m samples on a single machine. 

This paper makes three main contributions. First, in Section[3]we provide a sharp analysis of the 
Avgm algorithm, showing that under a reasonable set of conditions on the statistical risk functional, 
it can indeed achieve substantially better rates than the naive approach. More concretely, we 
provide bounds on the mean-squared error that decay as 0{{nm)~^ -|-n~^). Whenever the number 
of machines m is less than the number of samples n per machine, this guarantee matches the best 
possible rate achievable by a centralized algorithm having access to all N = nm samples. In the 
special case of optimizing log likelihoods, the pre-factor in our bound involves the trace of the Fisher 
information, a quantity well-known to control the fundamental limits of statistical estimation. We 
also show how the result extends to certain stochastic programming approaches, but we defer proofs 
of all results to later sections and appendices. 

Our second contribution is to develo p a novel extension of simple averaging; it is based on 
an appropriate form of bootstrap [lol. |22|. which we refer to as the bootstrap average mixture 
(Bavgm) approach. At a high level, the Bavgm algorithm distributes samples evenly among m 
processors or computers as before, but instead of simply returning the empirical minimizer, each 
processor further subsamples its own dataset in order to estimate the bias of its own estimate, 
and returns a bootstrap-corrected estimate. We then prove that the Bavgm algorithm has mean- 
squared error decaying as 0{m~^n~^ -\- n~^). As long as m < n^, the bootstrap method again 
matches the centralized gold standard up to higher order terms. These rates are sharper than any 
previous work, and moreover are minimax optimal up to constant factors. 

Our third contribution is to perform a detailed empirical evaluation of both the AvGM and 
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Bavgm procedures, which we present in Sections S] and [5l Using simulated data from normal 
and non-normal regression models, we explore the conditions under which the Bavgm algorithm 
yields better performance than the AvGM algorithm; in addition, we study the performance of 
both methods relative to an oracle baseline that uses all samples. We also study the sensitivity 
of the algorithms to the number of splits m of the data, and in the Bavgm case, we investigate 
the sensitivity of the method to the amount of resampling. These simulations show that both 
AvGM and Bavgm have favorable performance, even when compared to the unattainable "gold 
standard" procedure that has access to all N samples. In Section [5l we complement our simulation 
experiments with a large logistic regression experiment that arises from the problem of predicting 
whether a user of a search engine will click on an advertisement. This experiment is large enough — 
involving N ~ 2.4 x 10^ samples in d ~ 740, 000 dimensions with a storage size of approximately 
55 gigabytes — that it cannot be solved efficiently on one machine. Consequently, a distributed 
approach is essential to take full advantage of this data set. Our experiments on this real-world 
problem show that the resampling and correction that the Bavgm method provides substantial 
performance benefits over naive solutions as well as the averaging algorithm AvGM. We provide 
our conclusions and discussion in Section [6l 



2 Problem set-up 

In this section, we formally describe our setting and the statistical optimization problems we con- 
sider, providing necessary background and defining notation as needed. We begin by setting up our 
decision-theoretic framework for empirical risk minimization, after which we describe our algorithms 
and the assumptions we require for our main theoretical results. 



2.1 Empirical risk minimization 

Let {/(•; x), X E X} be a collection of real- valued and convex loss functions with domain containing 
the convex set @ C M"^. Let P be a probability distribution over the sample space X, and define 
the population risk functional Fq : — t- M 

Foi9):=Ep[fi9;X)]= [ f{9-x)dP{x). (1) 

Our goal is to estimate the parameter vector minimizing the population risk, namely the quantity 

0* := argminFo(0) = / f{e;x)dP{x), (2) 
See Jx 

which we assume to be unique. In practice, the population distribution P is unknown to us, but 
we have access to a collection S of samples from the distribution P. Empirical risk minimization 
is based on estimating 6* by solving the optimization problem 



G argmm 



1^1 ... 



Throughout the paper, we impose some standard regularity conditions on the parameter space, 
the risk function Fq, and the instantaneous loss functions f{-;x) : — t- M. These conditions are 



natural and standard in classical statistical analysis of M-estimators (e.g., 15|, [l^). Our first 



assumption deals with the relationship of the parameter space to the optimal parameter 9* 
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Assumption A (Parameters). The parameter space G C M*^ is a compact convex set, with 
6* G int© and i2-radius R = maxg^Q \\6 — 6*\\2. 

In addition, the risk function is required to have some amount of curvature. We formalize this 
notion as follows: 

Assumption B (Local strong convexity). There exists some A > such that the Hessian matrix 
ofFo satisfies V^Fo{e*) >z \h>,d- 

Here V^Fo(^) denotes the dx d Hessian matrix of the population objective Fq evaluated at 9, and 
we use ^ to denote the positive semidefinite ordering (i.e., A ^ B means that A — B \s positive 
semidefinite.) This local condition is milder than a global strong convexity condition and is required 
to hold only for the population risk Fq evaluated at 9* . Of course, some type of curvature of the 
risk is required to consistently estimate the parameters 9* . (Without some local curvature, there 
may be no unique optimal parameter 9* , so the model would be — in a sense — ill-specified.) 

2.2 Averaging methods 

In the distributed setting, we are given a dataset of = mn samples, drawn i.i.d. according to the 
initial distribution P, which we divide evenly and randomly among a total of m processors. (For 
simplicity, we assume the total number of samples is a multiple of m.) For i = 1, . . . , m, we let Si^i 
denote the data set assigned to processor i; by construction, it is a collection of n samples drawn 
i.i.d. according to P, and the samples in subsets Si^i are Sij are independent for i / j. In addition, 
for each processor i we define the (local) empirical distribution Pi^i and empirical objective F\^i via 

^i'-=r^E'^-' Fi,,(0):=^ ^ /(0;x). (4) 

With this notation, the AvGM algorithm is very simple to describe: 

Average mixture algorithm 

(1) For each i G {1, . . . ,m}, processor i uses its local dataset Si^i to compute the local empirical 
minimizer 

9i^i e argmin | V f{9- x) \. (5) 

' V ' 

(2) These m local estimates are then averaged together — that is, we compute 

^ m 

9i = -^9i,i. (6) 

i=l 

The bootstrap average mixture (Bavgm) algorithm is based on an additional level of sampling 
on top of the first, involving a fixed subsampling rate r G [0,1]. In particular, it consists of the 
following three steps: 
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Bootstrap average mixture algorithm 



(1) Each processor i draws a subset (5*2,1 of size [rn] by sampUng uniformly at random without 
replacement from its local data set Si^i. 

(2) Each processor i computes both the local empirical minimizers 9i^i from equation ([5]), and 
the empirical minimizer 



(3) In addition to the previous average the Bavgm algorithm computes the bootstrap average 
^2 := ^ X^i^i ^2,15 and then returns the weighted combination 



The intuition for the weighted estimator ([8]) is similar to that for standard bias correction 
procedures using the bootstrap [13] • Roughly speaking, if bo = 9* — 9i is the bias of the first 
estimator, then we may approximate bo by the bootstrap estimate of bias bi = 9i — 92- Then, since 
9* = 9i + bQ, we use the fact that 6i ~ bo to argue that 9* = 9i + bQ ^ 9i + bi. The re-normalization 
is needed to enforce that the relative "weights" of 9i and 92 sum to 1. 

The goal of this paper to understand under what conditions — and in what sense — the estima- 
tors d^D and ([5]) approach the oracle performance, by which we mean the error of a centralized risk 
minimization procedure that is given access to all N = nm samples. 

Notation: Before continuing, we define the remainder of our notation. We use £2 to denote the 
usual Euclidean norm \\9\\2 = iYl'j=i^j)'^ ■ The ^2-operator norm of a matrix A G ]^'^i>^'^2 jg j^g 
maximum singular value, defined by 



(If F is not differentiable, we may replace VF with any subgradient of F.) We let <Si denote the 
Kronecker product, and for a pair of vectors u,v, we define the outer product u v = uv'^ . For 
a three-times differentiable function F, we denote the third derivative tensor by V^F, so that for 
each u G domF the operator V^F{u) : M'^^'^ — t- R"^ is linear and satisfies the relation 




(7) 




(8) 





We denote the indicator function of an event £ by l(^), which is 1 if is true and otherwise. 
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3 Theoretical results 



Having described the AvGM and Bavgm algorithms, we now turn to describing their statistical 
properties, presenting our two main theorems in this section. After each theorem, we give some 
commentary and a few corollaries to help to better explain the behavior of the algorithms, as well 
as comparing our convergence results to previous work. To guarantee good estimation properties of 
our algorithms, we require regularity conditions on the empirical risks Fi and F2. It is simplest to 
state these in terms of the sample functions f{9; x), and we note that, as with Assumption |Bl these 
are only required to hold locally around the optimal point 9*, in particular within some Euclidean 
bah U = {9 €W^\ \\9* - 6*112 < p} C G, where p G (0, 1) is the ball radius. 

Assumption C (Smoothness). There are finite constants G,H such that the first and the second 
partial derivatives of f exist and satisfy the bounds 

E[\\Vf{9;X)\\l]<G^ and E[\\\V^f{9;X)-V^Fo{9)\\\l]<H^ for all 9 eU. 

In addition, for any x £ X, the Hessian matrix S/'^f{9;x) is L{x)-Lipschitz continuous, meaning 
that 

\\V'^f{9'-x)-V'^f{9-x)\\^<L{x)\\9' -9\\^ for all 9,9' , (9) 
where, for some finite constant L <oo, we have E[L{Xf] < and E[{L{X) - E[L{X)])^] < . 

To be clear, it is actually necessary to impose some type of smoothness condition on the Hessian 
matrix, as in the Lipschitz condition ([9]), in order for averaging methods to work, as we now 
demonstrate. In fact, this example underscores the difficulty of proving that the AvGM algorithm 
achieves better mean-squared error than single-machine strategies. 

Example (Necessity of Hessian conditions). Let X be a Bernoulli variable with parameter ^, and 
consider the loss function 

... ^ f^^-^ ifx = 

/(^;^)= .21 -f 1 (10) 

i^l(0<o)+^ ifx = l, 

where l(£«o) is the indicator function for the event {9 < 0}. The associated population risk is 
-^o(^) = ^(^^ + ^^l(e<o)- Since \Fq{w) — Fq{v)\ < 2\ut — v\, the population risk is strongly convex 
and smooth, but it has a discontinuous second derivative. By inspection, the unique minimizer 
of the population risk is 9* = 0, and by an asymptotic expansion we have that E[0i^j] = il(n~2). 
(See Appendix [D] for details of this step.) Consequently, the bias of 9i is Q{n^2), and the AvGM 
algorithm using N = mn observations must suffer mean squared error E[{9i — 9*)"^] = Q{n~^). 
Thus, we see that some type of Hessian smoothness is necessary for fast rates. 

Although the previous example establishes necessity, it is in some sense pathological: both the 
smoothness condition given in Assumption [C] and the local strong convexity condition given in 
Assumption [B]are relatively innocuous for practical problems. For instance, both conditions will 
hold for standard forms of regression, such as linear and logistic, as long as the population data 
covariance matrix is not rank deficient and the data has suitable moments. Moreover, in the linear 
regression case, one has L = 0. 
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3.1 Bounds for simple averaging 

Our setup complete, we now turn to giving our theoretical results. Our first theorem gives the con- 
vergence rate of the AvGM procedure. We recall that 6* denotes the minimizer of the population 
objective function Fq, and that for each i £ {1, . . . ,m}, we use Si to denote a dataset of n inde- 
pendent samples. Recall that for each i, we use 9i G argmin{- ^ f{6;x)} to denote a minimizer 

of the empirical risk for the dataset Si, and we define the averaged vector 6 = The 
following result bounds the mean-squared error between this averaged estimate and the minimizer 
9* of the population risk. 

Theorem 1. Under Assumptions\^through\^ there exists a finite numerical constant C such that 



E 







|9-9*||2 






nm 



\v'^Fo{e*)-^vf{e*-x)\ 



(11) 



+ 



c 



2 log d + L^G^) E 



^*r^yf{e*-x)\ 



A simple (but somewhat weaker) corollary of Theorem [T] makes it easier to parse. In particular, 
note that 



V^Fo(r)-V/(r;x) 



< 



\v^Fo{e*) 



l|V/(r;x)||, 



(ii) 1 



A 



< 



l|V/(r;x)|| 



2 ' 



(12) 



where step (i) follows from the inequality l^xlg < 



12' 



valid for any matrix A and vector x; 



and step (ii) follows from Assumption|Bl In addition, Assumption [Cl implies E[|| V/(0*; X) Hg] < G^, 
and so, putting together the pieces, we have established the following corollary. 

Corollary 1. Under the same conditions as Theorem{l\ 



E 



< 



9^2 nc^ 

+ ^ (H'^ log d + L^G^) + 0(m- V-2) + 0(n-3) 



A2 



nm 



(13) 



This upper bound shows that the leading term decays proportionally to (nm)~^, with the pre- 
factor depending inversely on the strong convexity constant A and growing proportionally with the 
bound G on the loss gradient. Although easily interpretable, the upper bound (jl3p can be loose, 
since it is based on the relatively weak bound (I12p (i). 



Note that the leading term in our original upper bound (jlip involves the product of V f{9*; X) 
with the inverse Hessian. In many statistical settings, including the problem of linear regression, 
the effect of this matrix-vector multiplication is to perform some type of standardization. When 
the loss f{-',x) : O — 7- M is actually the negative log-likelihood £{x \ 0) for a parametric family of 
modelsiPe}, we can make this intuition precise. In particular, under suitable regularity conditions 
(e.g., |l5l . Chapter 6]), we can define the Fisher information matrix 



i{e*) 



E 



vi{x I e*)vi{x I e 



E[v^^(x I e*)]. 



Recalling that N = mn is the total number of samples available, then under our assumptions, 
standard minimax theory guarantees for any estimator 6^^ based on N samples, essentially 



liminf iVE 



> tr(/(^*) 
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For instance, see Theorem 8.11 in van der Vaart 29|] for a result of this type. (A formal statement 
requires taking a type of supremum over the true parameter 0*.) In connection with Theorem [H 
we obtain: 

Corollary 2. In addition to the conditions of Theorem\^ suppose that the loss functions f{-',x) 
are the negative log-likelihood i{x \ 6) for a parametric family {Pe, 9 G &}. Then there is a finite 
numerical constant C such that 



E 



< — tr(/(6l 



+ 



Cm^ tr{I{e 



log d + L^G^) + 0{mN-^). 



Proof Rewriting the log-likelihood in the notation of Theorem [H we have V£(x | ^*) = V/(0*; x) 
and all we need to note is that 



I{9 



I{9*)-^V£{X I 9*)Vi{X I 9*)^I{9*y^ 
{V^Fo{9*y'Vf{9*;X)) {V^Fo{9*r'Vf{9*;X)) 



Now apply the linearity of the trace and use the fact that tr(nn^) 



\u\ 



□ 



Except for the factor of two in the bound, Corollary [2] shows that Theorem [T] essentially achieves 
the best possible result. The important aspect of our bound, however, is that we obtain this 
convergence rate without calculating an estimate on all N = mn samples: instead, we calculate m 
independent estimators, and then average them to attain the convergence guarantee. We remark 
that an inspection of our proof shows that, at the expense of worse constants on higher order terms, 
we can reduce the factor of 2/mn on the leading term in Theorem [T] to (1 + c) /mn for any constant 
c > 0; as made clear by Corollary [21 this is essentially unimprovable. 



3.2 Bounds for bootstrap mixture averaging 

For small m, Theorem [1] and Corollary [1] show that the convergence rate of the AvGM algorithm is 
mainly determined by the first term in the bound (|lip . which is at most In contrast, when 

the number of processors m grows, the second term in the bound (jll|) . in spite of being 0{n~'^), 
may have non- negligible effect. This issue will be exacerbated when the local strong convexity pa- 
rameter A of the risk Fq is close to zero or the Lipschitz continuity constant H of Vf{9; x) is large. 
This concern motivated our development of the bootstrap average mixture (Bavgm) algorithm, 
and we now turn to its analysis. 

Due to the additional randomness introduced by the bootstrap algorithm Bavgm, its analysis 
requires an additional smoothness condition. In particular, recalling the Euclidean p-neighborhood 
U of the optimum 9*, we require that the loss function / is (locally) smooth through its third 
derivatives. 

Assumption D (Strong smoothness). The third derivatives of f are Lipschitz continuous, meaning 
that for each x £ X, there is a constant M{x) such that 

II (VV(^; x) - V^/(0'; x)) {u(^u)\\^< M{x) \\e - e'W^ \\u\\l for all 9, 9' G U, and u G W^, 

where E[M^(X)] < for some constant M < oo. 
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It is easy to verify that Assumption [P] holds for least-squares regression (with M = 0); it also holds 
for logistic regression problems whose covariates have finite fourth moments. 

We now state our second main theorem, which shows that the use of bootstrap samples to 
reduce the bias of the AvGM algorithm yields improved performance. 

Theorem 2. Under Assumptions lA\ through [Pi the output ^bavgm = (^i — r^2)/(l — r) of the 
bootstrap Bavgm algorithm has mean-squared error bounded as 



E 



I C'Bavgm ~ P 1 1 2 



2 + 3r 1 

(1 — rj^ nm 



v^Fo{e*)''vf{e*-x)\ 



Comparing the conclusions of Theorem [2] to those of Theorem [H we see that the the 0{n~'^) 
term in the bound (jlip has been eliminated. The reason for this elimination is that subsampling 
at a rate r reduces the bias of the Bavgm algorithm to 0{n~^); the bias of the AvGM algorithm 
induces terms of order n"^ in Theorem [TJ Theorem [2] suggests that the performance of the Bavgm 
algorithm is affected by the subsampling rate r; a rough estimate for r might be to take r = y/m/n 
so long as m < n^. Roughly, when m becomes large we increase r, since the bias of the independent 
solutions may increase and we enjoy averaging affects from the Bavgm algorithm. When m is small, 
the Bavgm algorithm appears to provide limited benefits. We leave as an intriguing open question 
whether computing multiple bootstrap samples at each machine can yield improved performance 
or reduce the variance of the Bavgm procedure, and whether using estimates based on resampling 
the data with replacement — rather than subsampling without replacement — can yield improved 
performance. 

3.3 Time complexity 

In practice, the exact empirical minimizers assumed in Theorems [T] and [2] may be unavailable. 
Instead, we need to use a finite number of iterations of some optimization algorithm in order to 
obtain reasonable approximations to the exact minimizers. In this section, we sketch an argument 
that shows that both the AvGM algorithm and the Bavgm algorithm can use such approximate 
empirical minimizers, and as long as the optimization error is sufficiently small, the resulting 
averaged estimate achieves the same order-optimal statistical error. Here we provide the arguments 
only for the AvGM algorithm; the arguments for the Bavgm algorithm are completely similar. 

More precisely, suppose that each processor runs a finite number of iterations of some opti- 
mization algorithm, thereby obtaining the vector 0[ as an approximate minimizer of the objective 
function Fi^i, and hence an approximation to 9i. Let &' = ^ Sl^i ^'i denote the average of these 
approximate minimizers, which corresponds to the output of the approximate AvGM algorithm. 
With this notation, we have 



E 



< 2E[||6l - 61*112] -F2E 



(a) ,,9 ,,9 

< 2E[^-r 2] + 2E[ a (14) 



|2J I — Liri -i||2J 

where step (i) follows by triangle inequality (and the elementary bound (a -|- 6)^ < 20? + 26^), and 
step (ii) follows by Jensen's inequality. Consequently, suppose that each processor i runs enough 
iterations to obtain an approximate minimizer such that 

n\\e[-6,\\l] = 0{{mnr^). (15) 
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When this condition holds, the bound (I14p shows that the average ^ of the approximate minimizers 
shares the same convergence rates provided by Theorem [TJ 

We now consider how long it takes to compute an approximate minimizer Q\ satisfying condi- 
tion (lisp . In particular, assuming processing one sample requires one unit of time, we claim that 
this computation can be performed in time log (mn)). In particular, the following two-stage 
strategy, involving a combination of stochastic gradient descent (see the following subsection for 
more details) and standard gradient descent, has this complexity: 

(1) First, as shown in the proof of Theorem [H with high probability, the empirical risk F\ is 
strongly convex in a ball Bp{Q\) of constant radius p > around d\. Consequently, perform- 
ing stochastic gradient descent on F\ for C'(log^(mn)//3^) iteratio ns y ields an approximate 
minimizer that falls within Bp{9i) with high probability (e.g., see [2d . Proposition 2.1]). 

(2) This initial estimate can be further improved by a few iterations of standard gradient descent. 
Under local strong convexity of the objective function, gradient descent is known to converge 
at a geometric rate Chapter 9], so that C'(log(l/e)) iterations will reduce the error to order 
e. In our case, we have e = (mn)~^, and since each iteration of standard gradient descent 
requires 0{n) units of time, a total of 0{n\og{mn) time units are sufficient to obtain a final 
estimate 6[ satisfying condition (jlSp . 

Overall, we conclude that the speed-up of the AvGM (and similar algorithms) over the naive 
approach of processing all N = mn samples on one processor is at least of order m/log(A^). 



3.4 Stochastic gradient descent with averaging 



The previous strategy involved a combination of stochastic gradient descent and standard gradient 
descent. In many settings, it may be appealing to use only a stochastic gradient algorithm, due 
to their ease of their implementation and limited computational requirements. In this section, we 
describe an extension of Theorem [T] to the case in which each machine computes an approximate 
minimizer using only stochastic gradient descent. 

Stochastic gradient algorithms have a lengthy history in both statistics and optimization 
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23|, |20|, |2j] . Let us begin by briefiy reviewing the basic form of stochastic gradient descent (SOD). 
Stochastic gradient descent algorithms iteratively update a parameter vector 0* over time based 
on randomly sampled gradient information. Specifically, at iteration t, a sample Xt is drawn at 
random from the distribution P (or, in the case of a finite set of data {Xi, . . . ,Xn}, a sample Xt 
is chosen from the data set). The method then performs the following two steps: 



e^H =0^- r]tVf{e^; Xt) and 6 



t+i 



argmm 
6»Ge 



(16) 



Here 774 > is a stepsize, and the first update in (fT6l) is a gradient descent step with respect to the 
random gradient V f{9^; Xt). The method then projects the intermediate point 9^~^2 back onto the 
constraint set (if there is a constraint set). The convergence of SGD methods of the form (jl6p 
has been well-studied, and we refer the reader to the papers l24i | for deeper investigations. 

To prove convergence of our stochastic gradient-based averaging algorithms, we require the 
following smoothness and strong convexity condition, which is an alternative to the Assumptions iBl 
and [O used previously. 
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Assumption E (Smoothness and Strong Convexity II). There exists a function L : X ^ M_|_ such 
that 

\\V^f{9; x) - V2/(r ; x)\\^ < L{x) \\9 - e*\\^ , 
and E[L2(X)] < < oo. There are finite constants G and H such that 

E[||V/(6';X)||2] < G^ and E[||| VV(6l*; X) < i?^ for each fixed 9 ^ @ . 

In addition, the population function Fq is \- strongly convex over the space Q, meaning that 

V^Fo{9)hMdxd for all 9 GO. (17) 

Assumption |E] does not require as many moments as does Assumption [Cl but it does require each 
moment bound to hold globally, that is, over the entire sample space 0, rather than only in a 
neighborhood of the optimal point 9* . Similarly, the necessary curvature — in the form of the lower 
bound on the Hessian matrix V^-Fq — is also required to hold globally, rather than only locally. 
Nonetheless, Assumption [E] holds for many common problems; for instance, it holds for any linear 
regression problem in which the covariates have finite fourth moments and the domain is compact. 

Moving to a description of the averaged stochastic gradient algorithm (Savgm), it is based on 
the following two steps: 

(1) Given some constant c > 1, each machine performs n iterations of stochastic gradient de- 
scent (fT6]l on its local dataset of n samples using the stepsize i]t = jj, then outputs the 
resulting local parameter 9[. 

(2) The algorithm computes the average 9^ = ^ SI^i ^'i- 

The following result characterizes the mean-squared error of this procedure in terms of the constants 

' 'cH' 



a := max < 4, — > and 13 := max 

^ '2-l/c_ 



A 



ca3/4G3/2 gi/^LCy^ AG + HR 

(C - l)A5/2 I Al/2 + p3/2 



Theorem 3. Under Assumptions\^ and\^ the output 9^ of the Savgm algorithm has mean-squared 
error upper hounded as 

aG^ /32 ^ ^ 



E 



Theorem [3] shows that the averaged stochastic gradient descent procedure attains the optimal 
convergence rate as a function of the total number of observations N = mn. The constant and 
problem-dependent factors are somewhat worse than those in the earlier results we presented in 
Theorems [1] and [21 but the practical implementability of such a procedure may in some circum- 
stances outweigh those differences. We also note that the second term of order 0{n~^^'^) may be 
reduced to 0{n^'^~'^^^/^) for any /c > 4 by assuming the existence of kth moments in Assumption [El 
we show this in passing after our proof of the theorem in Appendix ICl It is not clear whether a 
bootstrap correction is possible for the stochastic-gradient based estimator; such a correction could 
be significant, because the term /S^/ji^/^ arising from the bias in the stochastic gradient estimator 
may be non-trivial. We leave this question to future work. 
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4 Performance on synthetic data 



In this section, we report the results of simulation studies comparing the AvGM, Bavgm, and Savgm 
methods, as well as a trivial method using only a fraction of the data available on a single machine. 
For each of our simulated experiments, we use a fixed total number of samples N = 100,000, but we 
vary the number of parallel splits m of the data (and consequently, the local dataset sizes n = N/m) 
and the dimensionality d of the problem solved. 

For our experiments, we simulate data from one of two regression models: 

y={u,x)+e or (19a) 
y = {u,x) + h{x)\e\, (19b) 

where e ~ A^(0, 1), and /i is a function to be specified. Specifically, the data generation procedure is 
as follows. For each individual simulation, we choose fixed vector u G with entries Ui distributed 
uniformly in [0, 1]. The data samples consist of pairs (x,y), where x G and ?/ € M is the target 
value. To sample each x vector, we choose five distinct indices in {1, . . . ,d} uniformly at random, 
and the entries of x at those indices are distributed as A^(0, 1). In the sampling scheme (jl9ap . the 
target y is sampled according to a standard normal regression model, so e ^ N{0, 1) is independent 
normal noise. We use the heteroskedastic sampling scheme ()19bp to better understand the effects of 
bootstrap sampling in the Bavgm algorithm, as the standard least-squares estimator is unbiased for 
the model (jl9ap . In the heteroskedastic model ()19bp . we still sample e independently from A^(0, 1), 
but we set h{x) = Yli=ii^i/'^)^ ^ which is mean zero but forces the noise in y to be dependent on 
the data x. 

Regardless of the sampling strategy, in our simulation experiments we use the least-squares loss 

f{9;{x,y)) :=^{{e,x)-yf 

to estimate the vector u. The goal in each experiment is then to estimate the vector 9* that 
minimizes Fq{6) := E[/(0; (X, Y))]. For each simulation, we generate samples according to either 
the model ^jW^i or (pit]) . For each m G {2,4,8,16,32,64,128}, we estimate 9* = argmin^ ^0(6*) 
using a parallel method with data split into m independent sets of size n = N/m, specifically 

(i) The AvGM method 

(ii) The Bavgm method with several settings of the subsampling ratio r (we give specific values 
in the sequel) 

(iii) The Savgm stochastic method using stepsize rjt = d/{10{d + 1)), which gave empirically good 
performance. 

In addition to (i)-(iii), we also estimate 9* with 

(iv) The empirical minimizer of a single split of the data of size n = N/m 

(v) The empirical minimizer on the full dataset (the oracle solution). 
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Figure 1: The error — 0*||2 versus number of machines, with standard errors across twenty 
simulations, for solving least squares with data generated according to the normal model (jl9ap . 
The oracle least-squares estimate using all N samples is given by the line "All," while the line 
"Single" gives the performance of the naive estimator using only n = N/m samples. 
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(a) d = 20 (b) d = 200 

Figure 2: Comparison of AvGM and Savgm methods as in Figure [T] plotted on logarithmic scale. 
The plot shows ||^ — ^*||2 ~ ll^iv ~ ^*ll2' where 6j\f is the oracle least-squares estimator using all N 
data samples. 



4.1 Averaging methods 

For our first set of experiments, we study the performance of the averaging methods (AvGM and 
Savgm), showing their scaling as the number of splits of data — the number of machines m — grows 
for fixed N and dimensions d = 20 and d = 200. We use the standard regression model (jl9ap to 
generate the data, and throughout we let 9 denote the estimate returned by the method under 
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(a) d = 20 (h) d = 200 

Figure 3: The error — 0*||2 versus number of machines, with standard errors across twenty 
simulations, for solving least squares with data generated according to the non-normal model ()19bp . 
The oracle least-squares estimate using all N samples is given by the line "All," while the line 
"Single" gives the performance of the naive estimator using only n = N/m samples. 

consideration (so in the AvGM case, for example, this is the vector 9 := ^i). For the model (jl9ap . 
the population optimal vector 9* is simply u. 

In Figured! we plot the error ||0 — 0* ||| of the inferred parameter vector 9 for the true parameters 
9* versus the number of splits m, or equivalently, the number of separate machines available for 
use. We also plot standard errors (across twenty experiments) for each curve. As a baseline in 
each plot, we plot as a red line the squared error \\9n — 9*\\\ of the centralized "gold standard," 
obtained by applying a batch method to all samples. 

From the plots in Figure [U we can make a few observations. The AvGM algorithm enjoys 
excellent performance, as predicted by our theoretical results, especially compared to the naive 
solution using only a fraction 1/m of the data. In particular, if 9 is obtained by the batch method, 
then AvGM is almost as good as the full-batch baseline even for m as large as 128, though there 
is some evident degradation in solution quality. The Savgm (stochastic-gradient with averaging) 
solution also yields much higher accuracy than the naive solution, but its performance degrades 
somewhat more quickly than the AvGM method's as m grows. In higher dimensions, both the 
AvGM and Savgm procedures have somewhat worse performance; again, this is not unexpected. 

We present a comparison between the AvGM method and the Savgm method with somewhat 
more distinguishing power in Figure [2j For these plots, we compute the gap between the AvGM 
mean-squared-error and the unparallel baseline MSE, which is the accuracy lost due to paralleliza- 
tion or distributing the inference procedure across multiple machines. Figure [2] shows that the 
mean-squared error grows polynomially with the number of machines m, which is consistent with 
our theoretical results. Corollary [2] shows that we expect the AvGM method to suffer (lower-order) 
penalties proportional to as m grows, while Theorem [3] suggests the somewhat faster growth we 
see for the Savgm method in Figure El Thus we see that the improved run-time performance of the 
Savgm method — it requires only a single pass through the data on each machine, touching each 
datum only once — comes at the expense of some loss of accuracy, as measured by mean-squared 
error. 
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Figure 4: The error \\9 — plotted against the number of machines m for the AvGM and Bavgm 
methods, with standard errors across twenty simulations, using the normal regression model ()19ap . 
The oracle estimator is denoted by the line "All." 

In FigureEl we repeat the simulations generating Figured! but we use the non-normal regression 
model (fT9b|) to generate the data. In this case, the non-normality of the model means that the 
vector u generating the data is no longer equal to 9*, so we estimate 0* by solving a least-squares 
regression problem with lOiV the number of data points, and use this to measure the mean-squared 
error ||^ — 0*||2. Broadly, we see that the standard averaging (Avgm) and stochastic (Savgm) 
methods provide a strong benefit over naive solutions, achieving performance close to the gold- 
standard oracle estimator using all the data. 

4.2 Bootstrap correction 

We now turn to developing an understanding of the Bavgm algorithm in comparison to the standard 
average mixture algorithm, developing intuition for the benefits and drawbacks of the method. Be- 
fore describing the results, we remark that for the standard regression model (jl9ap . the least-squares 
solution is unbiased for 6*, so we expect bootstrap averaging to yield little (if any) improvements. 
The Bavgm method is essentially aimed at correcting the bias of the estimator 6i, and de-biasing 
an unbiased estimator only increases its variance. However, for the non-normal model (Il9b|) we do 
expect to see some performance gains. In our experiments, we use multiple sub-sampling rates to 
study their effects, choosing r S {0.005,0.01,0.02,0.04}. We remind the reader that the output of 
the Bavgm algorithm is the vector 9 := {9i — r^2)/(l — r). 

In Figure m we plot the results of simulations comparing Avgm and Bavgm with data generated 
from the normal regression model (jl9ap . Both algorithms have have low error rates, but the Avgm 
method is slightly better than the Bavgm method for both values of the dimension d and all and 
sub-sampling rates r. As expected, in this case the Bavgm method does not offer improvement over 
Avgm. (In Figure [D^a), we note that the standard error is in fact very small, but the mean-squared 
error is only of order 10"'^.) 

To that end, in Figure [5l we plot analogous mean-square error curves for the least-squares re- 
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Figure 5: The error ||^ — ^*||2 plotted against the number of machines m for the AvGM and 
Bavgm methods, with standard errors across twenty simulations, using the non-normal regres- 
sion model ()19bp . The oracle estimator is denoted by the line "All." 



gression problem when the vector y is sampled according to the non- normal regression model ()19b[) . 
In this case, the least-squares estimator is biased for 9* (which, as before, we estimate by solving 
a larger regression problem using 10 data samples). Figure [5] shows that both the AvGM and 
Bavgm method still enjoy good performance; in some cases, the Bavgm method even beats the 
oracle least-squares estimator for 6* that uses all N samples. Both are clearly better than the 
naive estimator based on N/m samples (recall Figure [3]) . Since the AvGM estimate is biased in 
this case, its error curve increases roughly quadratically with m, which agrees with our theoretical 
predictions in Theorem [H In contrast, we see that the Bavgm algorithm enjoys somewhat more 
stable performance, with increasing benefit as the number of machines m increases. For example, 
in case of d = 200, if we choose r = 0.01 for m < 32, choose r = 0.02 for m = 64 and r = 0.04 
for m = 128, then Bavgm has performance comparable with the oracle method that uses all N 
samples. Moreover, we see that all the values of r — at least for the reasonably small values we 
use in the experiment — provide performance improvements over a non-bootstrapped distributed 
estimator. 



5 Experiments with advertising data 

Predicting whether a user of a search engine will click on an advertisement presented to him or 
her is of central importance to the business of several internet companies, and in this section, we 
present experiments studying the performance of the AvGM and Bavgm methods for this task. We 



use a large dataset from the Tencent search engine, soso.com [28!], which contains 641,707 distinct 
advertisement items with N = 235,582,879 data samples. 

Each sample consists of a so-called impression, which in the terminology of the information 
retrieval liter;ture (e.g., see the book by Manning et al. 0) , is a list containing a user-issued 
search, the advertisement presented to the user in response to the search, and a label y £ {+1, — 1} 
indicating whether the user clicked on the advertisement. The ads in our dataset were presented 



16 



Table 1: Features used in online advertisement prediction problem. 



Feature Name 


Dimension 


Description 


Query 


zUUUU 


Word tokens appearing in the query. 


Lxenuer 


Q 
O 


Gender of the user 


Keyword 


zUUUU 


Word tokens appearing in the purchase keywords. 


iitie 


zUUUU 


Word tokens appearing in the ad title. 


Advertiser 


oylyl 


Advertiser's ID 


AdiU 


r' A '\ 'ic\'i 

D4i7U7 


Advertisement's ID. 


Age 


n 

D 


Age of the user 


UserFreq 


z5 


Number of appearance of the same user. 


Position 


6 


Position of advertisement on search page. 


Depth 


6 


Number of ads in the session. 


QueryFreq 


zo 


Number of occurrences of the same query. 


AdFreq 


Zb 


Number of occurrences of the same ad. 


QueryLength 


on 


Number of words in the query. 


i itieLengtn 


on 


Number of words in the ad title. 


DespLength 


^n 


Number of words in the ad description. 


QueryCtr 


150 


Average click-through-rate for query. 


UserCtr 


150 


Average click-through-rate for user. 


AdvrCtr 


150 


Average click-through-rate for advertiser. 


WordCtr 


150 


Average click-through-rate for keyword advertised. 


UserAdFreq 


20 


Number of times this user sees an ad. 


UserQueryFreq 


20 


Number of times this user performs a search. 



to 23,669,283 distinct users. 

Transforming an impression into a useable set of regressors x is non-trivial, but the Tencent 
dataset provides a standard encoding. We list the features present in the data in Table [TJ along with 
some description of their meaning. Each text-based feature — that is, those made up of words, which 
are Query, Keyword, and Title — is given a "bag-of- words" encoding [17[. This encoding assigns 
each of 20,000 possible words an index, and if the word appears in the query (or Keyword or Title 
feature), the corresponding index in the vector x is set to 1. Words that do not appear are encoded 
with a 0. Real- valued features, corresponding to the bottom fifteen features in Table [1] beginning 
with "Age" , are binned into a fixed number of intervals [— oo, oi] , (ai , 02] , . . . , (a^, 00] , each of which 
is assigned an index in x. (Note that the intervals and number thereof vary per feature, and the 
dimension of the features listed in Table [T] corresponds to the number of intervals). When a feature 
falls into a particular bin, the corresponding entry of x is assigned a 1, and otherwise the entries 
of X corresponding to the feature are 0. Each feature has one additional value for "unknown." 
The remaining categorical features — gender, advertiser, and advertisement ID (AdID) — are also 
given {0, 1} encodings, where only one index of x corresponding to the feature may be non-zero 
(which indicates the particular gender, advertiser, or AdID). This combination of encodings yields 
a binary- valued covariate vector x € {0, l}'^ with d = 741,725 dimensions. 

Our goal is to predict the probability of a user clicking a given advertisement as a function of 
the covariates in Table [TJ In order to do so, we use a logistic regression model to estimate the 
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Figure 6: The negative log-likelihood of the output of the AvGM, Bavgm, and a stochastic gra- 
dient method (the update (fT6]) ) on the held-out dataset for the the click-through prediction task. 

(a) Performance of the AvGM and Bavgm methods versus the number of splits m of the data. 

(b) Performance of the SGD baseline as a function of number of passes through the entire dataset. 



probability of a click response 



P{y = l\x;9) :-- 



l + exp(-(0,x))' 



where 6 G M"^ is the unknown regression vector. We use the negative logarithm of P as the 
loss, incorporating a ridge regularization penalty. This combination yields the overall optimization 
problem 

f{9;{x,y)) = log{l + eM-y{0,x))) ^ 



2 

10- 



2 • (20) 
a choice obtained by cross 



|2 
l2' 



as we do not know 



In all our experiments, we use regularization parameter A 
validation. 

For this problem, we cannot evaluate the mean-squared error 
the true optimal parameter 9*. Consequently, we evaluate the performance of an estimate 9 using 
log-loss on a held-out dataset. Specifically, we perform a five-fold validation experiment, where 
we shuffle the data and partition it into five equal-sized subsets. For each of our five experiments, 
we hold out one partition to use as the test set, using the remaining data as the training set for 
inference. When studying the AvGM or Bavgm method, we compute the local estimate 9i via a 
trust-region Newton-based method 21 1. 

The dataset is too large to fit in the memory of most computers: in total, four splits of the 
data require 55 gigabytes. Consequently, it is difficult to provide an oracle training comparison 
using the full samples. Instead, for each experiment, we perform 10 passes of stochastic gradient 
descent through the dataset to get a rough baseline of the performance attained by the empirical 
minimizer for the entire training dataset. Figure [6|^b) shows the hold-out set log-loss after each of 
the sequential passes through the training data finishes. 

In Figure [U^a), we show the average hold-out set log-loss (with standard errors) of the estimator 
9i provided by the AvGM method versus number of splits of the data m, and we also plot the log- 
loss of the Bavgm method using subsampling ratios of r G {.1, .25}. The plot shows that for small 
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Figure 7: The log- loss on held-out data for the Bavgm method applied with m = 128 parallel splits 
of the data, plotted versus the sub-sampling rate r. 



m, both AvGM and Bavgm enjoy good performance, comparable to or better than (our proxy 
for) the oracle solution using all N samples. As the number of machines m grows, however, the 
de-biasing provided by the subsampled bootstrap method yields substantial improvements over the 
standard AvGM method. In addition, even with m = 128 splits of the dataset, the Bavgm method 
gives better hold-out set performance than performing two passes of stochastic gradient on the 
entire dataset of m samples; with m = 64, Bavgm enjoys performance as strong as looping through 
the data four times with stochastic gradient descent. This is striking, as doing even one pass 
throu gh the data with stochastic gradient descent is known to give minimax optimal convergence 



rates 



It is instructive and important to understand the sensitivity of the Bavgm method to the value 
of the resampling parameter r. We explore this question in Figure [7| using m = 128 splits, where 
we plot the log-loss of the Bavgm estimator on the held-out data set versus the subsampling ratio 
r. We choose m = 128 because more data splits provide more variable performance in r. For the 
soso.com ad prediction data set, the choice r = .25 achieves the best performance, but Figure [7] 
suggests that mis-specifying the ratio is not terribly detrimental. Indeed, while the performance 
of Bavgm degrades to that of the AvGM method, a wide range of settings of r give improved 
performance, and there does not appear to be a phase transition to poor performance. 



6 Discussion 



Large scale statistical inference problems are challenging, and the difficulty of solving them will 
only grow as data becomes more abundant: the amount of data we collect is growing much faster 
than the speed or storage capabilities of our computers. Our AvGM and Bavgm methods provide 
strategies for efficiently solving such large-scale risk minimization problems, enjoying performance 
comparable to an oracle method that is able to access the entire large dataset. We believe there are 
several interesting questions that remain open after this work. First, reproducing kernel Hilbert 
space methods, which suffer scaling cubic in the size of the data, may benefit from the distributed 
strategies in this paper, though their analysis is not immediate. More generally, an understanding of 
the interplay between statistical efficiency and communication could provide an avenue for further 
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research, and it may also be interesting to study the effects of subsampled or bootstrap-based 
estimators in other distributed environments. 
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A Proof of Theorem [T] 

Although Theorem [T] is in terms of bounds on S*'* order moments, we prove a somewhat more 
general result in terms of a set of {kQ,ki,k2) moment conditions given by 



E[||V/(e;X)||^'"]<G'^», 



E[\\\V^fi9;X)-V^Fo{9)\\\':']<H''\ 



E[L(X)'=2] < ^fe2 and E[(L(X) - E[L{X)])''^] < L^^ 

for 9 £ U. (Recall the definition of U prior Assumption [C]). Doing so allows sharper control if higher 
moments are known on certain quantities. The reader should recall throughout our arguments that 
we have assumed minjfco, fci, ^2} > 8. Throughout the proof, we use Fi and 9i to indicate the 
local empirical objective and empirical minimizer of machine 1 (which have the same distribution 
as those of the other processors), and we recall the notation l^^^ for the indicator function of the 
event £. 

Before beginning the proof of Theorem [1] proper, we begin with a simple inequality that relates 
the error term 9 — 9* to an average of the errors 9i — 9*, each of which we can bound in turn. 
Specifically, a bit of algebra gives us that 



m9 



E 



^ m 

-E' 



1 1 



= -n\\9i - rill + \\E[9i - 9*]\\l 

< ^E[\\9i-9*\\l] + \\K[9i-9*]\\l. 



(21) 



Here we used the definition of the averaged vector 9 and the fact that for i ^ j, the vectors 9i and 9j 
are independent (they are functions of independent samples). The upper bound (j2ip illuminates the 
path for the remainder of our proof: we bound each of E[||6li - 9*\\^^] and \\E[9i - 9*]\\2. Intuitively, 
since our objective is (locally) strongly convex by Assumption [Bl the empirical minimizing vector 
9i is a nearly unbiased estimator for 9*, which allows us to prove the convergence rates in the 
theorem. 

We begin by defining three events — which we (later) show hold with high probability — that 
guarantee the closeness of 9i and 9*. Intuitively, these events imply that the function Fi looks 
quite similar to the population risk Fq around the point 9*; since Fq is (locally) strongly convex, 
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the minimizer 9i of Fi will be close to 9*. Recall that Assumption [Cl guarantees the existence of a 
ball Up = {e eW^ : \\e - 9*11^ < p} of radius p G (0, 1) such that 



\v'f{e;x)-v^f{e';x)\L<Lix)\\e-e'\ 



2 



for all 9,9' G Up and any x, where E[L(X)*'2] < L^^. In addition, Assumption [B] guarantees that 
V^Fo(0*) >: XI. Now, choosing the potentially smaller radius 6p = mm{p, pX/4L}, we can define 
the three "good" events 



i=l 



£,:=^\\\V^F,i9*)-V^Foi9*)\\\^<P^}, and (22) 
£2:=\\\VF^{9*)\\2< 



{l-_p)X6p 
2 

We then have the following lemma, which states that as long as Fq and Fi are similar near their 
optima, then the distance between 9i € argminggQ Fi(^) and 9* decreases with the norm of the 
gradient ||VFi(0*)||2: more precisely, so long as VFi{9*) is small, then H^i — 6**112 decreases at least 
linearly in || VFi(6l*) Hg. 



Lemma 1. Under the events Sq, £i, and £2 previously defined (|22|) . we have 

p^_0*\\^nVFi{Qh^ and V^F,{9)h{l-p)XId.,. 
(l-p)A 

The proof of Lemma [1] relies on some standard optimization guarantees relating gradients to mini- 
mizers of functions (e.g. 0], Chapter 9), although some care is required since smoothness and strong 
convexity hold only locally in our problem. As the argument is somewhat technical, we defer it to 
Appendix [El 

Our approach from here is to give bounds on E[||6'i - 6l*||2] and ||E[6li - 9*]\\2 by careful Tay- 
lor expansions, which allows us to bound E[||6'i — ^*||2] via our initial expansion ([2T]) . Similar 
Taylor expansions are of course familiar from asymptotic convergence in distribution results for M- 
estimators (e.g. [l^, Chapter 9), but we require more care to control a larger number of moments. 
We begin by noting that whenever the events Sq, £1, and £2 hold, then VFi{9i) = 0, and moreover, 
by a Taylor series expansion of VFi between 9* and 9i, we have 

= VFi{9i) = VFi{9*) + V^Fi{9'){9i - 9*) 

where 9' = k9* + (1 — k)9i for some k G [0, 1]. By adding and subtracting terms, we have 

= VFi(r) + {V^Fi{9') - V^Fi{9*)){9i - 9*) 

+ {V^Fi{9*) - V^Fo{9*)){9i - 9*) + V'^Fq{9*){9i - 9*). (23) 

Since V^i<b(^*) ^ XI, we can define the inverse Hessian matrix := [V'^ Fo{9*)]^^ , and setting 
A := ^1 — 9*, we multiply both sides of the Taylor expansion (|23p by to obtain the relation 

A = -^-^VFi{9*) + ^-^{V^Fi{9*) - V^Fi{9'))A + J:-\V^Fo{9*) - V^Fi{9*))A. (24) 
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Thus, if we define the matrices P = V'^Fo{e*) - V^Fi{e*) and Q = V^Fi{e*) - V'^Fi{e'), equal- 
ity ()24p can be re-written as 

Oi-e* = -s-ivFi(r) + + q){9i - e*). (25) 

Note that equation (|25p holds when the conditions of Lemma [1] hold, and otherwise we may simply 
assert only that ||^i — 9*\\2 < R- Roughly, we expect the final two terms in the error expansion (|25p 
to be of smaller order than the first term, since we hope that — 0* — )■ and additionally that the 
Hessian differences decrease to zero at a sufficiently fast rate. We now formalize this intuition. 

Inspecting the Taylor expansion ()25p . we see that there are several terms of a form similar to 
(^7^^0(6**) — V'^Fi{9*)){9i — 9*); using the smoothness Assumption \C\ we can convert these terms 
into higher order terms involving only 9i — 9*. Thus, to effectively control the expansions (j24p 
and (|25p . we must show that higher order terms of the form E[||0i — ^*||2]5 for k > 2, decrease 
quickly enough in n. 



Control of E[||0i — 6**112]: Recalling the events (f22]l . we define £:= SoH £1(182 and then observe 
that 

Em - r 11^1 = E[i(^) 11^1 - r 11^] + E[i(^c) 11^1 - r 11^] 

2^E[lfcJ|VFi(r)||^l 

^ 2^E[||VFi(r)||^] , 
- (l-p)^-A^ +^(^ )^ ' 

where we have used the bound \\9 — 9*\\2 < R for all ^ € 0, from Assumption [Al Our goal is to 
prove that E[||VFi(r)||^] = 0(n-*^/2) and that ¥{£") = 0(n-^l'^). 

We move forward with a few somewhat technical (but short) lemmas that lay the groundwork 
for proving these two facts. 

Lemma 2. Under Assum'ption\^ there exist constants C and C (dependent only on the moments 
k() and ki respectively) such that 



E[||VFi(r)||^°]<C-^, and (26a) 
E[\\\V^Fi{9*) - V^Fo{9*)\f'] < g/ lQg''^y^)-H'^\ (26b) 

See Appendix |F] for the proof of this claim. As an immediate consequence of Lemma [2l we see 
that the events £1 and £2 defined by (j22p occur reasonably high probability. Indeed, recalling that 
I? = iSo n iSi n , Boole's law and the union bound imply 

F{£^) = F{£^ U £f U £1) 

< F{£S) + P(^f ) + F{£l) 

nihEtlL{X^) -nnX)])'''] , 2^iE[|||v2Fi(r)-V2Fo(r)f^] 2*^oE[||VFi(r)||^o] 



Qko 



< — — — i J '-Jii 1: — : 1^ Llll i + 

1 lo^^^^(2d)H^ G^o 
^C'2^ + Ci -j^^ + ^0^ ^27) 
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for some universal constants Co,Ci,C2- Consequently, we find that ¥{£^)R'' = 0{R^{n ^^/^ + 
^-k2/2 _|_ ^-fco/2~) £q], ^ g In conclusion, we have proved the following lemma: 

Lemma 3. Let Assumptions\^ and\^ hold. For any G N with k < min{/co, ki,k2}, we have 



] = 0[n 



-k/2 _ 



(1 - p)'=A'= 

where the order statements hold as n — )• +00. 



+ n 



-ko/2 



+ n~ 



RecaU the matrix Q = V^Fi{e*) - V'^Fi{e') defined following equation 1^. The following 
result controls the moments of its operator norm: 



Lemma 4. For k < min{/c2, /ci, A;o}/2, we have E[|||Q|||2] = 0{n~''/^). 



Proof We begin by using Jensen's inequality and Assumption [Q to see that 
< - V III v2/(^'; X,) - v2/(r ; X,)|||' < - V L{XiY 

^ z — i/ III III 'n ^ — 



n 



\k Wo' 

i=l " 1=1 

Now we apply the Cauchy-Schwarz inequality and Lemma [3l thereby obtaining 



E 



< E 



n 



i=l 



E 



\2k 



n 



-k/2 



where we have used Assumption [Cl again. 



□ 



Lemma [3] allows us to control the first term from our initial bound (|2ip almost immediately. 
Indeed, using our last Taylor expansion (j25|) and the definition of the event £ = £q r\ £1 r\ £2, 
have 



n\Oi-e*\\l] 



= E 
< 2E 



i(^) ||-s"VFi(r) + + Q){ei - e*)f} + e[i 



1* 1 1 2-1 



|S"^VFi(6'*)| 



+ 2E 



|S-i(P + Q)(ei-r)||' +P(r)i22 



where we have applied the inequality (a + 6)^ < 2a^ + 26^. Again using this same inequality, then 
applying Cauchy-Schwarz and Lemmas [2] and HI we see that 



s^H^ + Q){Oi - < 2 |||s-i|||2 (e[|||p|||2 11^1 - r ||2] +e 



2 Iri 



rill 



E[|||p|||^]E[||0i - eX] + ^niQWlmoi-e^wl] 



e 



< 2 |||S"^||| 

where we have used the fact that min{/co, fci, k2} > 8 to apply Lemma HI Combining these results, 
we obtain the upper bound 



E[||6'i -6**112] < 2E [||S"^VFi(r 
which completes the first part of our proof of Theorem [TJ 



(28) 
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Control of ||E[6'i - 6l*]||2: It remains to consider the ||IE[0i — 6*]\\2 term from our initial error in- 
equality (I2ip . When the events (I22p occur, we know that all derivatives exist, so we may recursively 
apply our expansion (j25p of 9i — 9* to find that 

01-9* = -J:~^VFi{9*) + + Q){9i - 9*) 

= -S~iVFi(r) + + Q) [-S~iVFi(r) + ^-^(P + Q){9i - 9*)] (29) 



where we have introduced v as shorthand for the vector on the right hand side. Thus, with a bit 
of algebraic manipulation we obtain the relation 

9i-9* = \s)v + \e^){9i -9*)=v + \£c){9i - 9*) - \e^^v = v + \£c){9i -9*- v). (30) 

Now note that 

EH = E [-s-ivFi(r) + s-^(p + Q)[-s~ivPi(r) + i:-\p + q){wi - w*)]] 

= E [T.-\P + Q) [S-i(P + Q){9i - 9*) - VFi(r)]] . 

Thus, by re-substituting the appropriate quantities in ([30]) and applying the triangle inequality, we 
have 

l|E[0i-r]|i2 

< ||E[S-i(P + Q)S-i ((P + Q){wi - w*) - VPi(r))]||2 + ||E[l(£c)(0i -9*- v)]\\^ 

< ||E[S-i(P + Q)S-i ((P + QKwi - w*) - VPi(r ))]||2 + E[l(^.) 11^1 - 9*y 

+ E [i(^c) ||-s~ivPi(r) + s-i(p + Q) [-s-VPi(r) + ^-\p + q)(0i - 9*)] || j 

Since — 9*\\2 < P by assumption, we have 



(31) 



(0 



E[l(£c) 11^1 - 0*112] < ¥{£'')R = 0{Rn 



-k/2 



for any k < min{A;2, ki, ko}, where step (i) follows from the inequality (f27|l . Now, Holder's inequality 
also yields that 



E [i(^c) lls-^p + Q)S"ivPi(r)||2] < E \\\^~\p + Q)|||2 ||vPi(r 



< yP(^E |||S~\P + Q) 



1/4. 



E 



|s~VPi(r) 



1/4 



Recalling Lemmas [2] and HI we have E[|||S ^(P -|- Q) III2] = 0{log'^{d)n ^), and we similarly have 

E[||S~iVPi(6'*)||2] = 0{n~^). Lastly, we have = 0{n~'') for k < mm{ko , ki , , whence we 

find that for any such k, 



E [i(^c) \\^~Hp + Q)S^^VPi(r)||2] = O v/bi(d)n 



-fe/4-1 



We can similarly apply Lemma [3] to the last remaining term in the inequality (I31|) to obtain that 
for any k < min{A;2, ki, ko}, 



E [1 



{£-) 



-^-^VFi{9*) + S-i(P + Q) [-S-iVPi(r) + S-i(P + Q){9i - 9*)] 



0(n-'=/2 + n-'=/^-^) 
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Applying these two bounds, we find that 

||K[0i-r]||2 < ||E[S"i(P + Q)S-i((P + Q)K 



w 



+ 0(n~^') (32) 



for any k such that k < uim{kQ, ki, k2}/2 and k < min{A;o, ki, /c2}/4 + 1. 

In the remainder of the proof, we show that part of the bound ()32p stih consists only of higher- 
order terms, leaving us with an expression not involving 9i — 9*. To that end, note that 



E 



by three applications of Holder's inequality, the fact that ||^x||2 < 



and Lemmas [3] and m 



Coupled with our bound (j32p . we use the fact that (a + 6)^ < 2a^ + 26^ to obtain 

||IE[^i - e*]\\l < 2 ||E[S-i(P + Q)^-^VFi{e*)]\\l + 0{n-^). (33) 
We focus on bounding the remaining expectation. We have the following series of inequalities: 

(i) 



E[S-i(P + Q)S-iVFi(r)]||2 < E [|||s-i(p + Q)|||2 Fi{e*)\\^] 



(ii) 

< (E 



(iii) 

< (2E 



|S-i(P + Q)| 



Is-iplll! 



E 



\T,-^VFi{9*) 



K'Q\\\ 



E 



|s-ivFi(r)||2 



Here step (i) follows from Jensen's inequality and the fact that ||Aa;||2 < |||^|||2 \\x\\2' step (ii) uses 
the Cauchy-Schwarz inequality; and step (iii) follows from the fact that (a + 6)^ < 2a^ + 26^. We 
have already bounded the first two terms in the product in our proofs; in particular. Lemma [2] 
guarantees that E[|||P|||2] < CHlogd/n, while 



E 



< E 



1 " 



2n2 



■ n 



for some numerical constant C (recall LemmaU]). Summarizing our bounds on |||P|||2 and 
have 



2' 



we 



|E [S-i(P + Q)E-^VFi(r)] I 

nil 2 /2i/2(logci+l) 



< 2 E" 



+ 2C 



n 



+ 0(71-2) )e ||s-1vFi(6I* 



(34) 



From Assumption [Cl we know that E[||VFi(^*)||2] < G"^ /n and < 1/A, and hence we can 

further simplify the bound (134^ to obtain 



|E[6ii 



2 ^ £ fH^\ogd + L^G^/{l-pf 



A2 



n 



E 



S-VFi( 



+ 0{n 



for some numerical constant C , where we have applied our earlier inequality (]33p . Noting that we 
may (without loss of generality) take p < ^, then applying this inequality with the bound (j28p on 
E[||6'i - 6**112] we previously proved to our decomposition (j2ip completes the proof. 
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B Proof of Theorem [2] 



Our proof of Theorem [2] begins with a simple inequality that mimics our first inequality ()2ip in 
the proof of Theorem [TJ Recall the definitions of the averaged vector 9i and bootstrap averaged 
vector 02. Let 9i denote the minimizer of the (an arbitrary) empirical risk Fi, and 62 denote the 
minimizer of the resampled empirical risk F2 (from the same samples as ^i). Then we have 



E 



rU2 



1 



2" 








< 


E 


2 







rU2 



1 





2 1 






+ — E 






to 





rfc'2 



1 



(35) 



Thus, parallel to our proof of Theorem[Tl it suffices to bound the two terms in the decomposition ([35 
separately. Specifically, we prove the following two lemmas. 



Lemma 5. Under the conditions of Theorem\^ 



E 



rti2 



1 



O 



1 



r(l — r) 



Lemma 6. Under the conditions of Theorem\^ 



E 



< (2 + 3r)E 



|V2Fo( 



+ 0{n 



(36) 



(37) 



In conjunction. Lemmas [5] and [6] coupled with the decomposition (I35p yield the desired claim. 
Indeed, applying each of the lemmas to the decomposition (j35]) . we see that 



E 



rU2 



< 



2 + 3r 



(1 



E 



\v'^FQ{e*y^vFi{e*) 



+ o 



{i-ry 



-m 



+ 



1 



r(l — r)' 



which is the statement of Theorem [2j 

Lemma [6] is quite similar to the results used in the proof of Theorem [T] Lemma [5] is more subtle 
and key to our argument, as it shows that by including the rate-r-subsampled estimates 62 and 
correcting using the bootstrap-estimated bias r(^i — ^2)1 we can obtain an estimate of 9* which has 
expectation of smaller order than was possible in Theorem [TJ 

With this set-up, our proof now proceeds in three steps. First, we give asymptotic expansions 
of the differences 9i — 6* and 62 — 0*. These expansions allow us to show a key result on the bias- 
correction provided by the bootstrap, since the expansions 02 — 0* and 61 — 0* turn out to have only 
second-order differences, which yields Lemma [5l We prove Lemma [6] using techniques analogous to 
those for Theorem [H but we defer its (somewhat more technical) proof to Appendix [Gl 

Throughout the rest of the proof, we use the notation 

Y = Y' + nk 

for some random variables Y and Y' to mean that there exists a random variable Z such that 
Y = Y' + Z and E[||Z||^] = 0(n-'=)Q The symbol TZk may indicate different random variables 



^ Formally, in our proof this will mean that there exist random vectors Y, Y' , and Z that are measurable with 
respect to the cr-field a{Xi, X„), where Y = Y' + Z and E[\\Z\\l] = 0{n~''). 
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throughout a proof and is notational shorthand for a moment-based big-0 notation. We also 
remark that if we have E[||Z||2] = 0{a^n~^), we have Z = a^/'^TZk-, since (a'^/^)^ = of'. For 
shorthand, we also say that E[Z] = 0{h{n)) if ||E[Z]||2 = 0{h{n)), which implies that ii Z = TZk 
then E[Z] = 0{n-''/^), since 



mZ]\\,<^E[\\Z\\l]=0{n-'^/'). 
B.l Optimization Error Expansion 

In this section, we derive a sharper asymptotic expansion of the optimization errors 9i — 9* . Recall 
our definition of the Kronecker product (8), where for vectors u,v we have u0 v = uv^ . With this 
notation, we have the following expansion of 9i — 9*. 

Lemma 7. Under the conditions of Theorem\^ we have 

9i-9* = -S~^VFi(6i*) + T.~'^{V^Fi{9*) - S)S~^VFi(6'*) 

- S~^V^Fo(^*) ((S~^VFi(0*)) (S~^VFi(0*))) +7^3. (38) 

We provide a proof of Lemma [7] in Appendix [Hi The lemma requires somewhat more careful 
moment control over the expansion 9i — 9*, which leads to some technical difficulty, but is similar 
in spirit to the results used to prove Theorem [TJ 

An immediately analogous result to Lemma [7] follows for our sub-sampled bootstrap estimators. 
Since we use rn samples to compute 92, the second level estimator, we find 

Lemma 8. Under the conditions of Theorem\^ we have 

92-9* = -T,~^VF2{9*) + Y,^^{V'^F2{9*) - T,)T,-^V F2{9*) 

- J:~^V^Fo{9*) ((S-^VF2(r)) {^~^VF2{9*))) + r-^^^JZs. 

B.2 Bias Correction 

Now that we have given Taylor expansions that describe the behavior of 9i — 9* and 92 — 9*, we can 
prove Lemmas El and El (though, as noted earlier, we defer the proof of Lemma [6] to Appendix [G]). 
The key insight is that expectations of terms involving Vi^2(^*) are nearly the same as expectations 
of terms involving VFi{9*), except that some corrections for the sampling ratio r are necessary. 
We begin by noting that 

9i-r92 9,-9* 92-9* 

r— . (39) 



1 — r 1 — r 1 — r 

In Lemmas [7] and El we derived expansions for each of the right hand side terms, and since 

E[^-^VFi{9*)] = and E[^-^VF2{9*)] = 0, 

Lemmas [3 and El coupled with the rewritten bootstrap correction (j39p yield 

E[0i -9* - r{92 - 9*)] = rE[S~^(v2F2(0*) - S)S~^VF2(0*)] 

- K[T,-^(y'^Fi{9*) - S)S"^VFi(6l*)] 

+ rE[^-^V^Fo{9*) ((S-iVF2(r)) ® (S-^VF2(r )))] 

- E[J:-^V^Fo{9*) ((S"^VFi(r)) ^ {^~^VFi{9*)))] 
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Here the remainder terms follow because of the r~^/^7^3 term on 62 — 9*. We can now give our 
proof of Lemma [5l 

Proof of Lemma [5] To prove the claim in the lemma, it suffices to show that 



rE[i:-\v'^F2{e*) - S)S-^VF2(6l*)] = E[S"^(V2Fi(6I*) - S)S"^VFi(6'*)] 



(41) 



and 



(42) 



rE[j:~^v^Fo{e*) ((s-^vF2(r)) (S"ivF2(r)))] 
= E[^~^v^Fo{e*) ((S"ivFi(r)) (s~ivFi(r)))] 

Indeed, these two claims combined with the expansion (I40p yield the bound ()36p in Lemma [5] 
immediately. 

We first consider the difference ()4ip . To make things notationally simpler, we define functions 
let A: X R'^'"^ and B : X M.'^ via A{x) := T.'^ f {6* ; x) - S) and B{x) := T^-^V f {6* ; x) . 
If we let Si = {Xi, . . . , Xn} be the original (non-bootstrap) samples and 52 = {Yi, . . . , Ym} be the 
subsampled dataset, we must show 

n 



rE 



I, J 



But since the Yi are sampled without replacement, and E[A(Xj)] = and E[i?(Xj)] = 0, we find 
that E[A{Yi)B{Yj)] = for i / j, and thus 



J2E[A{Y,)BiY,)] =J^E[A{Yi)B{Y,)] = rnE[A{Yi)B{Yi)]. 

i,j i=l 

In particular, we see that the equality ()4ip holds: 

rn ^ ^ n 

Y,m{Y^)B{Y,)] = -E[AiYi)BiYi)] = -E[A{Xi)B{Xi)] = - E[A(X,)i?(X,)]. 



(rn)^ 



The statement (|42p follows from completely parallel arguments; we omit details for brevity. □ 



C Proof of Theorem [3] 

We begin by recalling that if ^" denotes the output of performing stochastic gradient on one 
machine, then from the inequality (I2ip we have the upper bound 



E[\\T - 9*\\l] < ^iE[||r - rii^] + ||E[r - 0*]\\l 

To prove the error bound ([TH]) . it thus suffices to prove the inequalities 



E[||r-r||2l < and (43a) 

||Er-r]||^<-^. (43b) 



n 
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Before proving the theorem, we introduce some notation and a few preliminary results. Let gt = 
V f{9^; Xt) be the gradient of the t''^ sample in stochastic gradient descent, where we consider 
running SGD on a single machine. We also let 



Il{v) := argmin < \\9 — v\\2 \ 



denote the projection of the point v onto the domain Q. 

We now state a lemma due to Rakhlin et al. [2j], which gives sharp rates on the convergence 
of the iterates {0*} in stochastic gradient descent. 

Lemma 9 (Rakhlin, Shamir, Sridharan). Assume that ¥,[\\gt\^ < for all t. Choosing Vt = jj, 
for some c > 1/2, for any t G N we have 



E 



,21 . 



12 



< where a = max(4, c/(2 — 1/c)). 



With these ingredients, we can now turn to the proof of Theorem [3l Lemma [9] gives the 
inequality (|43ap . so it remains to prove that 6 has the smaller bound (j43bp on its bias. To that 
end, recall the neighborhood C/p C © in Assumption [El and note that 

Qt+i - y^tgt - e*) 

= e'- T^tgt -e* + i(e*+i0c/,) (n(e* - mgt) - (0' - vtgt)) 

since when 9 E Up, we have Il{9) = 9. Consequently, an application of the triangle inequality gives 

||E[e*+i - r]||2 < \\E[9' - T^tgt - 9*]\\^+E[\\iUi9' - vtgt) - (9' - r]tgtm9'+' i Up)\^. 

By the definition of the projection and the fact that 0* € 0, we additionally have 

||n(0* - ^tgt) - {9' - r]tgt)\\2 < \\9' - (9' - mt)^^ < Vt ■ 

Thus, by applying Holder's inequality (with the conjugate choices (p, q) = (4, |)) and AssumptionlEt 
we have 



E[^*+i-r]||2 < 


\E[9* 


- Vtgt 


< 


\E[9* 


- Vtgt 


< 


\E[9^ 


- Vtgt 


< 


\E[9^ 


- Vtgt 



.4/3 A 3/4 
p2 1 



(44) 



the inequality (j44p following from an application of Markov's inequality. By applying Lemma O 
we finally obtain 

^2 \ 3/4 



|E[0*+i-r]||2 < \\E[9* -7]tgt-9*]\\2 + i]tG(^ 



„^3/4--y5/2 1 

\\n0'-vtgt-9*]\\, + ^^j^.^^. (45) 
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Now we turn to controlling the rate at which 9^ — rjtgt goes to zero. Let ft{-) = /(•; Xt) be short- 
hand for the loss evaluated on the t^^ data point. By defining n = gt - ft{0*) - V'^ft{0*){9^ - 6**), 
a bit of algebra yields 

9t = + v^M9*)io' - e*) + n. 

Since 0* belongs to the u-field of Xi, . . . , Xt-i, the Hessian V^ft{9*) is (conditionally) independent 
of 0* and 

E[gt] = V'Fo{e*)E[e' - e*] + E[n^e^^u,)]+nrtM0^(uJ- (46) 
If 0* £ Up, then Taylor's theorem implies that rt is the Lagrange remainder 

rt = {v^ft{e')-v^ft{e*)){e'-e*), 

where 9' = k9^ + (1 — k)9* for some k G [0, 1]. Applying Assumption lEl and Holder's inequality, we 
find that since 0* is conditionally independent of Xt, 



E 



Irdie^euJ^] < E f {9' ; Xt) - f{9*; Xt)\\\ \\9' - , 
L{Xt)\\9'-9*\\l] = E[L{Xt)mW 



< E 

< LE 



I2 ^io'e^Up) 



< 



aLG^ 
1^' 



On the other hand, when 0* ^ Up, we have the following sequence of inequalities: 



E 



rdie^^uj, < m\\rt\\t]{n9'^Up)) 



3/4 



< {E[\\gt\\t]+E[\\VM9*)\\t]+E[\\V^ft{9*){9^ - 9^)\\t]) (P(&* Up)f' 

< 33/44/^4 + ^4 + ^4^4 (p(0t f/^))3/4 



(Hi) 

< 3{G + HR) 



Here step (i) follows from Holder's inequality (again applied with the conjugates {p,q) = (4, |)); 
step (ii) follows from Jensen's inequality, since (a + 6 + c)^ < 33 (a'' + 54 + ^4). g^^^ g^gp (jjj) follows 
from Markov's inequality, as in the bounds ()44p and (I45p . Combining our two bounds on rt, we 
find that 

„ , aLG2 3a3/4G3/2(G' + //ij) 1 

nWnU < + >- ■ ^. (47) 

By combining the expansion (j46p with the bound ()47p . we find that 



|E[(/ - r]tV^Fo{9*)){9' - 9*) + r]trt]\\^ 

caLG^ ^ 3ca^/^G^/^{G + HR)_ J_ 



\n9'- iMt - 9*] 



Using the earlier bound (|45p . this inequality then yields 



< \\E[{I-r,tV^F^{9*)){9'-9*)]\l + 



A5/2p3/2 



\E[9 



t+i 



1|L< |||/-??iV*Fo(r) 



\EW 



+ 



ca'I^G^I^ (a^l^LG^ ^ AG + HR \ 



^5/2^7/4 I Xl/2^l/A 



.3/2 
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We now complete the proof via an inductive argument using our immediately preceding bounds. 
Our reasoning follows a similar induction given by Rakhlin et al. 2J]. First, note that by strong 
convexity and our condition that |||V^Fo(0*)||| < we have 

\\l-iltV^F^{e*)\\ = 1 - 7?tA„,in(V2Fo(r) < 1 - r/tA 

whenever 1 — r]tH > 0. Define tq = IcH/X] ; then for t > to we obtain 



IE[ 



< {i-c/t)\\E[e'-e*]\L + 



12 ^7/4 ;^5/2 

For shorthand, we define two intermediate variables 



1 ca3/4G3/2 (a^^LG^/^ AG + HR^ 



AV2tl/4 



+ 



,3/2 



at = E( 



and b 



^^3/4^3/2 /«l/4^G'l/2 ^G + HR 



A5/2 



Ai/2 



+ 



38/2 



(48) 



Inequality (j48p then implies the inductive relation a^+i < (1 — c/t)at + hjt^l'^ . Now we show that 
by defining /3 = max{roi2, 6/(c — 1)}, we have at < fi/t^/^. Indeed, it is clear that ai < tqR. Using 
the inductive hypothesis, we then have 



^ (1 - c/t)(3 , b (3{t-l) p{c-l)-b m-V) 
at+i < — h ^T- = — < — — < 



/3 



t3/4 ^7/4 ^7/4 i2 

This completes the proof of the inequality (I43bp . 



t7/4 - + 1)3/4 • 



□ 



Remark If we assume kth moment bounds instead of 4th, i.e. E[||| V^l^**; ^) III2] < ^ and 
IE[||5t||2] < C^, we find the following analogue of the bound (f48l) : 

||e[^*+i - nil < (1 - c/t) \\E[9^ - e*]\\ 



+ 



fc-1 2fc-2 

1 ca k G k 



2k-l 
t — 



3fc-2 

A 



(54Vfc + 1) G + M^I'^HR a^/^LG'^/^ 



2fc-2 



+ 



In this case, if we define 



fc-l 2fc-2 

ca k G k 

3fc-2 

A fc 



(541/fc + 1) G + M^I^HR a^/i^LG^/^ 



2fc-2 
P fc 



+ 



^2/fc 



and /3 = max < tqR, 



c- 1 



2 2fc — 

we have the same result except we obtain the bound ||E[^"' — 6*]\\2 < fi"^ /n~k' 



D The necessity of smoothness 

Here we show that some version of the smoothness conditions presented in Assumption O are 
necessary for averaging methods to attain better mean-squared error than using only the n samples 
on a single processor. We use the loss function (fTO|) . and set no = ^1^=1 ^{Xi=o) to be the count of 
samples. Using 9i as shorthand for ^i.j, we see by inspection that the empirical minimizer 9i is 




^ - i when no < n/2 
1 — otherwise. 
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For simplicity, we may assume that n is odd. In this case, we obtain that 



n 



(no<n/2) 



E 



-1 



4 + 2^ ^ 

1=0 



n\i 1 
i)n~ 2 



n 

n 



(no>n/2) 



- E 

[n/2-] 



n\ n 

iJji 4 ' 2 



, , ln/2} 
1=0 





? n 


C) 


n 2(n — i) 



by the symmetry of the binomial. Adding and subtracting ^ from the term within the braces, 
noting that P{nQ < n/2) = 1/2, we have the equality 



, [n/2] 

noi] = - E 



i=0 



i n 1 

n ~ 2{n - i) ^ 2 



2^ \ i ) 2n(n - i) ' 



i=0 



If Z is distributed normally with mean 1/2 and variance 1/n, then an asymptotic expansion of the 
binomial distribution yields 



n L"/2J 

E 

1=0 



n\ i{n — 2i) 



i J 2n{n — i) 



E 



Z(l - 2Z , 1 

— ^ < z < - 

2-2Z ' - - 2 

> -E Z - 2^2 I < Z < - 
-2 ' - - 2 



+ o(n-V2) 



the final equality following from standard calculations. 



E Proof of Lemma [T] 

We first prove that under the conditions given in the lemma statement, the function Fi is (1 — p)A- 
strongly convex over the ball U := |0 e M"' : \\9 — 9*\\2 < <5p} around 9*. Indeed, fix 7 G [/, then 
use the triangle inequality to conclude that 

|||v2Fi(7) - V^Fo{9*)\\\2 < \\\V^Fi{j) - V^Fi{9*)\\\^ + \\\V^Fi{9*) - V^Foi9*)\\\^ 

<L\\j-9*\\, + P^. 

Here we used Assumption ICl on the first term and the fact that the event £1 holds on the second. 
By our choice of 6p < pX/AL, this final term is bounded by \p. In particular, we have 

V^Fo{9*)hM so v2Fo(7) ^ A/-pA/= (l-p)A/, 

which proves that Fi is (1 — /3)A-strongly convex on the ball U. 

In order to prove the conclusion of the lemma, we argue that since Fi is (locally) strongly 
convex, if the function Fi has small gradient at the point 9* , it must be the case that the minimizer 
9i of Fi is near 9* . Then we can employ reasoning similar to standard analyses of optimality for 
globally strongly convex functions (e.g. [3|], Chapter 9). By definition of (the local) strong convexity 
on the set U, for any 9' £ Q, we have 

Fi{9') > Fi{9*) + {VFi{9*),9' - 9*) + -^^^^ min 1 1 1 r - ^'Hg • 
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Rewriting this inequality, we find that 

min {||r - e'Wl , 6l] < [F,i6') - Fi(r) + {vF,ie*),e' 



(l-p)A 
2 

(1 ~p)\ 



Dividing each side by \\9' — 9*\\2, then noting that we may set 9' = n9i + (1 — n)9* for any k G [0, 1], 
we have 

51 \ ^2[Fi{K9i + {l-K)9*)-Fi{e*)] ^ 2\\VFi{ 



min <; K \\9i - r 1I2 , „, - ) < ' \. „/ " + 



Of course, -Fi(^i) < Fi{9*) by assumption, so we find that for any k G (0,1) we have the strict 
inequality 



^■^^UM rii 1 2||VFi(r)||, ^ 



the last inequality following from the definition of £2- Since this holds for any k G (0,1), if 
11^1 ~ 0*\\2 ^ ^p^ may set k = bpj \\9\ — 0*||2, which would yield a contradiction. Thus, we have 
11^1 ~ 0*\\2 — '^P' o^'^ earlier inequalities, 



Dividing by \\9i — 9*\\2 completes the proof. □ 



F Moment bounds 



In this appendix, we state two useful moment bounds, showing how they combine to provide a 
proof of Lemma [21 The two lemmas are a vector and a non-commutative matrix variant of the 
classical Rosenthal inequalities. We have 



Lemma 10 (Chen et al. [J], Theorem A. 1(2)). Let Xj G M'^^'^ be independent symmetrically 
distributed Hermitian matrices. Then 



E 



i=l 



l/k 



< V2elog(i 



i=l 



1/2 



+ 2elogd E[max|||Xi||n 



i/fc 



Lemma 11 (de Acosta 0], Theorem 2.1). Let k >2 and Xi be a sequence of independent random 
vectors in a separable Banach space with norm ||-|| andE[||Xj||'^] < 00. There exists a finite constant 
Ck such that 



E 



i=l 



Y^X, -E J^X, 



< Ck 



k/2 



J^EOlX.f] + J^EOlX.f] 



i=l 
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Now we turn to our proof Lemma [2j To prove the first bound (j26ap . let 2 < k < ko and note 
that by Jensen's inequahty, we have 



E[||VFi(r)||^] < |||VFi(r)||2-E[||VFi(r)||2]r +2'=-^E[||VFi(r)||2]^ 

Again applying Jensen's inequality, E[||V/(0*; X)||2] < G^. Thus by recalling the definition 
VFiie*) = ^Er=iV/(r;XO and applying the inequality E[\\VFi{e*)\y < E[|| VFi(r ) jj^] V2 < 
n~^^'^G, we see that Lemma [TT] implies 



E 



l|VFi(r)||^] 



k/2 



^2 ElE[||V/(0;X,)||^] + ^ X;E[||V/(r; X, 



i=l 



i=l 



+ 2^-^E[||VFi(r)||2] 



*MI ifc 



if]E[||V/(r;X, 



«y 112J 



+ n-^/2 5^E[||V/(r;X, 
i=l 



«y 112J 



Applying Jensen's inequality again to note that 

fc/2 

y"E[||V.f(r;X,)||^ll 

n 



-Y,n\\^f{o*;X,)\\l]] <-Y,n\\^fie*;X,)\\lf/'<G' 

i=\ J i=l 



completes the proof of the inequality (|26ap . 

The proof of the bound (I26bp requires a very slightly more delicate argument, involving sym- 
metrization step. Define matrices Zi = ^ {V^ f{6*; Xi) - V^Fo{9*)) . If G {±1} are i.i.d. 
Rademacher variables independent of Zi, then for any integer k in the interval [2,^2]; a standard 
symmetrization argument (e.g. [l^ . Lemma 6.3]) implies that 



E 



i=l 



1/k 



< 2E 



i=l 



1/k 



(49) 



Now we may apply Lemma \TT\ since the matrices SiZi are Hermitian and symmetrically dis- 
tributed; by expanding the definition of the Zi, we find that 



E 



1/k 



< 2^J2eAo■gd 



^ 5^E[(V2/(^;X,)-V2Fo(r))2''^' 



n 



i=l 



+ 4elogd ( ?i~'^E[max|||vV(^*;^i) - V^Fo(r)|f ] 



1/k 



Since the X,- are i.i.d., we have 



1 " \ 1/2 

-VE[(V2/(0;X,)-V2Fo(r))2 

i=l 



n-i/^E 
< n-^/^E 



v^f{e*;X)-v^Foie*)y 



1/2 



1/2 
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by Jensen's inequality, since |||^^/^||| = |||^|||^''^ for positive semidefinite A. Finally, noting that 



n-''E 



ma^\\\V^f{9*;Xi)-V'Fo{e*) 



\V^f{e*;X)-V^Fo{ 



completes the proof of the second bound (|26b|) . 



G Proof of Lemma [6] 

The proof of Lemma [6] follows from that of Lemmas [7] and El We first claim that 

Oi-e* = -^-^VFi{e*) + n2 and 02-9* = -^-^VF2{e*) + r-^7^2. 



(50) 



The proofs of both claims similar, so we focus on proving the second statement. Using the inequality 
(a + 6 + c)^ < 3(a^ + b'^ + c^) and Lemma El we see that 



E 



< 3E 



+ 3E 



\^-^v'^Fo{e*) ((s-ivF2(r)) (g) (s-ivF2(r))) 



(51) 

We now bound the first two terms in inequality ()5ip . Applying the Cauchy-Schwarz inequality and 
Lemma [21 the first term can be upper bounded as 



E 



\T,^\V^F2{9*) - S)S~^VF2(6'* 



< (e [|||s-^(v2F2(r) - S)!^] E [||S-iVF2 



1/2 



= (r-2)0(log2(d)n-2) .r-2o(n-2))^/" = r~^0{n~^), 

where the order notation subsumes the logarithmic factor in the dimension. Since V^Fq{6*) : R'^^ — )• 
is linear, the second term in the inequality (|5ip may be bounded completely analogously as it 
involves the outer product S~^VF2(^*) (8> S~^VF2(0*). (Recall the equality (155p in the proof of 
Lemma [71 ) Recalling the bound ()5ip . we have thus shown that 



E 



102 - r + S-iVF2(r)||Jl = r-20(n-2), 



or 02 — 9* = —T,~^V F2{0*) + r~^7^2- The proof of the first equality in equation (|50p is entirely 
analogous. 

We now apply the equalities ()50p to obtain the result of the lemma. We have 



E 



r - r{92 



E 



-^-^VFi{9*) + ri:^^VF2{9*)+n2 



Using the inequality (a + 6)^ < (1 + ??)a^ + (1 + l/rj)b'^ for any > 0, we have 

{a + b + cf <{1 + 7])a^ + (1 + l/??)(6 + cf 

< (1 + ri)a^ + (1 + 1/?7)(1 + a)b^ + (1 + 1/7?)(1 + l/a)c^ 
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for any 77, a > 0. Taking 1] = I and a = 1/2, we obtain (a + 6 + c)^ < 2a^ + 36^ + 6c^, so applying 
the triangle inequality, we have 



E 



= E 
< 2E 



-S-^VFi(0*) + rS-^VF2(0*) + 7^2 



|S"^VFi(6'*)| 



+ 3r^E 



|S"^VF2(e*) 



(52) 



Since F2 is a sub-sampled version of Fi, algebraic manipulations yield 



E 



— E 

rn 



|s~^vFi(r) 



r 



|S~^VFi(6l*) 



Combining equations (I52p and ()53p . we obtain the desired bound (I37p . 



(53) 



H Proof of Lemma [7] 

The proof follows from a slightly more careful application of the Taylor expansion (j23p . The starting 
point in our proof is to recall the success events (j22p and the joint event 8 := 8q r\ £i r\ £2- We 
begin by arguing that we may focus on the case where £■ holds. Let C denote the right hand side of 
the equality (j38p except for the remainder TZ-^ term. By Assumption [Cl we follow the bound ([2^ 
to find that 



E 



'^{8'^) 11^1 - ^*ll2 



so we can focus on the case where the joint event = iSq ^ <Si H 1S2 does occur. 

Defining A = 61 — 6* for notational convenience, on £ we have that for some k G [0, 1], with 
0' = (I - K)e^ + kO* , 

= VFi{e*) + V^Fi{e*)A + V^Fi{e'){A ® A) 
= VFi{9*) + V^Fo{e*)A + V^Fo{e*){A ® A) 

+ (V^Fiie*) - V^Fo{e*))A + {V^Fi{e') - V^Fo{9*)){A A). 

Now, we recall the definition S = V^Fq{6*), the Hessian of the risk at the optimal point, and solve 
for the error A to see that 

A = -S-^VFi(0*) - S'^(v2Fi(e*) - S)A - S~^V^Fi(6'*)(A ^ A) 

+ ^-\V^Fo{e*) - V^Fi{e')){A A) (54) 

on the event £. As we did in the proof of Theorem [H specifically in deriving the recursive equal- 
ity (I29p . we may apply the expansion (I25p A = 9i — 9* to obtain a clean asymptotic expansion 
of A using ([5i|) . Recall the definition P = V'^Fo{9*) — V'^Fi{9*) for shorthand here (as in the 
expansion (I25p . though we no longer require Q). 
First, we claim that 

l(£)(v3Fo(r) - V'Fi{9')){A A) = 7^3. (55) 

Adding and subtracting V^Fi[9*) from the above expression (and dropping l(^) for simplicity) we 
must control 

{V^Fo{9*) - V^Fi{9*)){A A) + {V^Fi{9*) - V^Fi{9')){A A). 
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To begin, recall that lu^fl 



\uv 



U o V 



,. By Assumption [Dl on the event £ we have 



that V^Fi is (1/n) Y^^^^ M(Xi)-Lipschitz, so defining M„ = (1/n) Y^^^-^ M{Xi), we have 



E 



II (V'^Fiie*) - V'^Fiie')) (A ® A)| 



< E 



Mn\\e* -e'W^WA^AWl 



< E [M, 



41 1/4 



E 



3/4 



0{n- 



by Holder's inequality and Lemma [3l The remaining term we must control is the derivative 
difference E[||(V3Fi(6'*) - V^Fo{e*)){A (g, A)\\l]. Define the random vector-valued function G = 
V{Fi — Fq), and let Gj denote its jth coordinate. Then by definition we have 



(V^Fi(0*) - V^Fo(0*))(A® A) = A^(v2Gi(^*))A ••• A ' (V^Gd(0*))A 



Therefore, by the Cauchy-Schwarz inequality and the fact that x'^ Ax < 



12' 



E 



|(V^Fi(r) - V^Fo{9*)){A A)| 



A^(v2Gj(r))A 



|A| 



E 



IV^G. 



1/2 



Applying Lemma [3] yields that E[||A||2] = 0{n ^). By defining the functions g{-;x) := \7f{-;x) 
VFo(-), we can write 



1 



For every coordinate j, the random matrices V^gj(0*; Xj) (i = 1, . . . , n) are i.i.d. and mean zero. By 
AssumptionO we have \\\V^gj{e*; Xi)\\\^ < 2L{Xi), whence E[|||v2gj(0*; Xi)!^] < 2^L^. Applying 
Lemma [TO] (or the technique used to prove inequality ()26bp following Lemma [T0|) . we thus obtain 



E 



|v2G,(r) 



so 



E 



\{V^Fi{e*) - V^Fo{e*)){A (g) A)\\l = 0{d\og{d)n- 



In particular, we have our desired result (j55p . 
Now we claim that 

i(£)V3Fi(r)(A ®A) = v^Fi{e*){{Y.-^vFi{e*)) (g (s-ivFi(r))) + iz^. 

Indeed, applying the expansion (|25p to the difference A = 0i — 0*, we have on £ that 
A A = (S~VFi(r)) (S~VFi(r)) + (S^^PA) (S~^PA) 

- (s-lpA) (s-ivFi(r)) - (s-^vFi(r)) » (s^^pa). 



(56) 
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We can bound each of the second three outer products in the equahty above similarly; we focus on 
the last for simplicity. Applying the Cauchy-Schwarz inequality, we have 



E 



< E 



|S-iVFi(6'*) 



E 



|S"ip(6ii -6** 



From Lemmas [3] and HI we obtain that 



E 



|S~iVFi(6l* 



^^'p{9i - e*)\ 



0(n 



after an additional application of Cauchy-Schwarz for the second expectation. This shows that 

and a similar proof applies to the other three terms in the outer product A® A. Using the linearity 
of V^Fi(0*), we see that to prove the equality ([56]) . all that is required is that 

i(^c)V3Fi(r) ((S"ivFi(r)) ® (s-ivFi(r))) = n^. 

For this, we apply Holder's inequality several times. Indeed, we have 

|i(£c)V3Fi(r) ((s-ivFi(r))»(s-^vFi(r)))f^ 



(57) 



E 



< E[i(fc)]i/^E \\\v^Fi{e*) ((s-^vFi(r)) (s^ivFi(r))) 



1 8/3' 

I2 



3/4 



< E[1(^c)]1/4e 



<E[1(^c)]V4e 



|V=^Fi( 



|V^Fi(6'*)| 



1 8/3 



1 16/3' 



3/4 



1/4 _ 



E 



|S-iVFi(e*)| 



2/4 



For the final asymptotic bound, we used Eq. (j27p to bound E[l(-£:c)], used the fact (from Assump- 
tionlCj) that E[L(A')^] < to bound the term involving V^FiiO*), and applied Lemma[3]to control 
E[||S^-'^VFi(0*)||2]. Thus the equality ((571) holds, and this completes the proof of the equality (i56]l . 
For the final step in the lemma, we claim that 

-1(^)S~^(V2Fi(^*) - S)A = T.-^{V^Fi{e*) - S)S-^VFi(e*) + 7^3• (58) 

To prove (I58p requires an argument completely parallel to that for our claim ([56]). As before, we 
use the expansion (j25p of the difference A to obtain that on 

-S~^(v2Fi(0*) - S)A 

= T.'^{V'^Fi{e*) - S)S~^VFi(6'*) - S"^(v2Fi(6I*) - S)S"^PA. 

Now apply Lemmas [3] and [J] to the final term after a few applications of Holder's inequality. To 
finish the equality ([58]), we argue that l(^s-)i:~^{V'^Fi{e*) - S)S-^VFi(6l*) = 7^3, which follows 
exactly the line of reasoning used to prove the remainder ()57p . 

Applying equalities (f55]l , ([56]) , and ([581) to our earlier expansion ([5^ yields that 

A = l(^) [ - S-^VFi(^*) - S~^(v2Fi(0*) - S)A - S-^V^Fi(e*)(A ® A) 
+ S-i(v3Fo(r) - V^Fi{6')){/^ ® A)] + l(^c)A 
= -S~^VFi(6'*) + S~^(V^Fi(6i*) - S)S~^VFi(6i*) 

- s-iv3Fi(r) ((s-ivFi(r)) ® (s-ivFi(r))) +7e3 + 1(^c)A. 

Of course, E[l(£;c) HAUg] < P(f^)i?^ = 0(n~^), so this yields the desired statement of the lemma. □ 
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